(Semi)automated approaches to data extraction for systematic reviews and meta-analyses in social sciences: A living review

Background An abundance of rapidly accumulating scientific evidence presents novel opportunities for researchers and practitioners alike, yet such advantages are often overshadowed by resource demands associated with finding and aggregating a continually expanding body of scientific information. Data extraction activities associated with evidence synthesis have been described as time-consuming to the point of critically limiting the usefulness of research. Across social science disciplines, the use of automation technologies for timely and accurate knowledge synthesis can enhance research translation value, better inform key policy development, and expand the current understanding of human interactions, organizations, and systems. Ongoing developments surrounding automation are highly concentrated in research for evidence-based medicine with limited evidence surrounding tools and techniques applied outside of the clinical research community. The goal of the present study is to extend the automation knowledge base by synthesizing current trends in the application of extraction technologies of key data elements of interest for social scientists. Methods We report the baseline results of a living systematic review of automated data extraction techniques supporting systematic reviews and meta-analyses in the social sciences. This review follows PRISMA standards for reporting systematic reviews. Results The baseline review of social science research yielded 23 relevant studies. Conclusions When considering the process of automating systematic review and meta-analysis information extraction, social science research falls short as compared to clinical research that focuses on automatic processing of information related to the PICO framework. With a few exceptions, most tools were either in the infancy stage and not accessible to applied researchers, were domain specific, or required substantial manual coding of articles before automation could occur. Additionally, few solutions considered extraction of data from tables which is where key data elements reside that social and behavioral scientists analyze.


Introduction
Across disciplines, systematic reviews and meta-analyses are integral to exploring and explaining phenomena, discovering causal inferences, and supporting evidence-based decision making.The concept of metascience represents an array of evidence synthesis approaches which support combining existing research results to summarize what is known about a specific topic (Davis et al., 2014;Gough et al., 2020).Researchers use a variety of systematic review methodologies to synthesize evidence within their domains or to integrate extant knowledge bases spanning multiple disciplines and contexts.When engaging in quantitative evidence synthesis, researchers often supplement the systematic review with meta-analysis (a principled statistical process for grouping and summarizing quantitative information reported across studies within a research domain).As technology advances, in addition to greater access to data, researchers are presented with new forms and sources of data to support evidence synthesis (Bosco et al., 2017;Ip et al., 2012;Wagner et al., 2022).
Systematic reviews and meta-analyses are fundamental to supporting reproducibility and generalizability of research surrounding social and cultural aspects of human behavior, however, the process of extracting data from primary research is a labor-intensive effort, fraught with the potential for human error (see Pigott & Polanin, 2020).Comprehensive data extraction activities associated with evidence synthesis have been described as time-consuming to the point of critically limiting the usefulness of existing approaches (Holub et al., 2021).Moreover, research indicates that it can take several years for original studies to be included in a new review due to the rapid pace of new evidence generation (Jonnalagadda et al., 2015).

The need for this review
In the clinical research domain, particularly in Randomized Control Trials (RCTs), automation technologies for data extraction are evolving rapidly (see Schmidt et al., 2023).In contrast with the more defined standards that have evolved throughout clinical research domains, within and across social sciences, substantial variation exists in research designs, reporting protocols, and even publication outlet standards (Davis et al., 2014;Short et al., 2018;Wagner et al., 2022).In health intervention research, targeted data elements generally include Population (or Problem), Intervention, Control, and Outcome (i.e., PICO; see Eriksen and Frandsen, 2018;Tsafnat et al., 2014).While experimental designs are considered a gold-standard for translational value, many phenomena examined across the social sciences occur within contexts which necessitate research pragmatism in both design and methodological considerations (Davis et al., 2014).
Consider, for example, the field of Human Resource Development (HRD).In HRD, a primary focal hub for research includes outcomes of workplace interventions intended to inform and improve areas such as learning, training, organizational development, and performance improvement (Shirmohammadi et al., 2021).While measuring intervention outcomes is a substantial area of discourse, HRD researchers have predominantly relied on cross-sectional survey data and the most commonly employed quantitative method is Structural Equation Modeling (Park et al., 2021).Thus, meta-analyses are increasingly essential for supporting reproducibility and generalizability of research.In these fields, data elements targeted for extraction would rarely align with the PICO framework, but rather, meta-analytic endeavors would entail extraction of measures such as effect sizes, model fit indices, or instrument psychometric properties (Appelbaum et al., 2018).

Related research
Serving as a model for the present study, Schmidt et al. (2023) are conducting a living systematic review (LSR) of tools and techniques available for (semi)automated extraction of data elements pertinent to synthesizing the effects of healthcare interventions (see Higgins et al., 2022).Exploring a range of data-mining and text classification methods for systematic reviews, the authors uncovered that early often employed approaches (e.g., rule-based extraction) gave way to classical machine-learning (e.g., naïve Bayes and support vector machine classifiers), and more recently, trends indicate increased application of deep learning architectures such as neural networks and word embeddings (for yearly trends in reported systems architectures, see Schmidt et al., 2021, p. 8).
In social sciences and related disciplines, several related reviews of tools and techniques for automating tasks associated with systematic reviews and meta-analyses have been conducted.Table 1 provides a summary of related research.
Based on extant reviews analyzing trends in Artificial Intelligence (AI) technologies for automating Systematic Literature Review (SLR) efforts outside of clinical domains, we noted several trends.First, techniques to facilitate abstraction, generalization, and grouping of primary studies represent the majority of (semi)automated approaches.Second, extant reviews highlight a predominant focus on supporting search and study selection stages, with significant gaps in (semi)automating data extraction.Third, evaluation concerns underscore the importance of performance metrics, validation procedures, benchmark datasets and improved transparency and reporting standards to ensure the reliability and effectiveness of AI techniques.Finally, challenges in cross-discipline transferability illuminate the need for domainspecific adaptations and infrastructures.
Existing reviews evidence the widespread application of techniques such as topic modeling, clustering, and classification to support abstraction, generalization, and grouping of primary research studies.Topic modeling, particularly Latent Dirichlet Allocation (LDA), is commonly applied to (semi)automate content analysis, facilitating the distillation of complex information into meaningful insights and identification of overarching trends and patterns across a literature corpus (Antons et al., 2020;Dridi et al., 2021;Roldan-Baluis et al., 2022;Yang et al., 2023).Additionally, classification and clustering techniques are commonly applied for tasks such as mining article metadata and automatically grouping papers by relevance to SLR research questions are (Feng et al., 2017;Sundaram & Berleant, 2023;Wagner et al., 2022).
(Semi)automation efforts in social sciences and related disciplines have primarily addressed supporting the search and study selection stages of SLRs (Cairo et al., 2019;Feng et al., 2017), with significant gaps in automation techniques for tasks such as data extraction (Göpfert et al., 2022;Sundaram & Berleant, 2023).Further, available software tools lack functionality to support activities beyond study selection (Kohl et al., 2018).Key findings across these reviews underscore the need for more comprehensive automation solutions, particularly for quantitative data extraction (Göpfert et al., 2022).
Additionally, researchers express transparency concerns regarding AI's reliance on black box models (Wagner et al., 2022) and limited visibility into underlying processes and algorithms in proprietary software solutions (Antons et al., 2020).Adding to these considerations, Antons et al. (2020) identified substantial reporting gaps, including 35 of 140 articles omitting details about software used.Since metrics alone may not be sufficient to objectively assess AI performance (Dridi et al., 2021), strategies for mitigating bias and ensuring transparency and fairness represent a substantial topic of automation discourse.
Ongoing research of AI tools for clinical studies (Sundaram & Berleant, 2023) and the extraction of PICO data elements from RCTs (Wagner et al., 2022) underscore the success of domain-specific adaptation efforts.While the promise of adopting AI-based techniques and tools in social science domains is evident (Cairo et al., 2019;Feng et al., 2017), extant research reveals challenges in transferring existing technologies across disciplines.Further, many SLR software applications are tailored specifically for health and medical science research (Kohl et al., 2018).Literature suggests that overcoming global obstacles can be facilitated by concentrated efforts to develop domain-specific knowledge representations, such as standardized construct taxonomies and vocabularies (Feng et al., 2017;Göpfert et al., 2022;Wagner et al., 2022).

Objectives
In the present study, we conduct a baseline review of existing and emergent techniques for the (semi)automated data extraction which focus on target data entities and elements relevant to evidence synthesis across social sciences research domains.This review covers data extraction tools for a range of data types-both quantitative and qualitative.Per the research protocol, social sciences categories included in this review were based on the branches of science and academic activity boundaries described by Cohen (2021; Chapter 2).Additional description is available in the project repositories, see 'Data availability' section.We report findings that supplement the growing body of research dedicated to the automatic extraction of data from clinical and medical research.

Protocol registration
This LSR was conducted following a pre-registered and published protocol (Legate & Nimon, 2023b).For additional details and project repositories, see 'Data availability' section.

Living review
We adopted the LSR methodology for this study primarily due to the pace of emerging evidence, particularly in light of ongoing technological advancements.The ongoing nature of an LSR allows for continuous surveillance, ensuring timely presentation of new information that may influence findings (Elliott et al., 2014(Elliott et al., , 2017;;Khamis et al., 2019).This baseline review was initiated upon peer approval of the associated protocol (Legate & Nimon, 2023b).It remains our intent for the review to be continually updated via living methodological surveys of published research (Khamis et al., 2019) following the workflow schedule as previously published in the protocol (see Figure 1; Legate & Nimon, 2023b).Necessary adjustments to the workflow will be detailed within each subsequent update.

Eligibility criteria
As in prior reviews, English language reports, published 2005 or later were considered for inclusion (Jonnalagadda et al., 2015;O'Mara-Eves et al., 2015;Schmidt et al., 2020).Eligible studies utilized, presented, and/or evaluated semiautomated approaches to support evidence-synthesis research methods (e.g., systematic reviews, psychometric metaanalyses, meta-analyses of effect sizes, etc.

Search sources
The search strategy for this review was developed by adapting the search strategy from a related LSR of clinical research (Schmidt et al., 2020).
We initially intended to conduct searches in the same databases used by Schmidt et al. (2020Schmidt et al. ( , 2021))  Study selection Title, abstract, and full-text screening was conducted using Rayyan (Ouzzani et al., 2016; free and subscription accounts available at https://www.rayyan.ai/).Three researchers (1000 abstracts per week) screened all titles and abstracts.
Researchers met weekly to review, resolve conflicts, and further develop the codebook for this LSR.All conflicts that arose during the title and abstract screening (n=103/N=10,644) were resolved on a weekly basis.Where disagreements arose, they were related to methods for abstractive text summarization and transferability of methods applied to clinical research studies (i.e., RCTs).In cases where level of abstraction and potential for transferability could not be determined from the abstract alone, full text articles were reviewed and discussed by all three researchers until consensus was reached.
For the data extraction stage, a Google form was developed following items of interest as described in the protocol.All data extraction tasks were performed independently in triplicate.Researchers met weekly to review and reach a consensus on coding of extracted items of interest.The extraction form was updated over the course of data extraction to better fit project goals and promote reliability of future updates.
We originally intended to conduct Inter-Rater Reliability (IRR) assessments to provide reliability estimates following each stage of the baseline review (Belur et al., 2018;Zhao et al., 2022).Given the nascency of our research and scope of our items of interest, coding forms allowed for input of "other" responses (e.g., APA data elements) that were not included in extant reviews that focus on medical and clinical data extraction (e.g., PICO elements).Further, data extraction presented opportunities to develop reporting structure for methods and items of interest that were not reported in prior literature (e.g., NER, open-source tools).A weekly review meeting was used to continually develop the project codebook to promote continuity, structure, and develop an IRR framework for future iterations of this review.

Search results
Search results are presented in the PRISMA flowchart (see Figure 2).A total of 11,336 records were identified through all search sources, including databases and publications available through the Systematic Review Toolbox (Marshall et al., 2022).After deduplication, 10,644 articles were included in the title and abstract screening stage.We retrieved 46 articles for full-text screening.One duplicate print was detected during full text screening and was removed.This iteration of the LSR includes 23 articles.Detailed description of deduplication and preliminary screening procedures are available in the OSF project repository (see 'Data availability' section).
The following sections describe the rationale for exclusions, followed by a brief overview of studies included in the baseline review.These results are presented in Figures 3 and 4, respectively.An overview of included studies is presented in Table 2.

Excluded publications
Most studies were excluded due to lack of detail in extracted data entities (n=7) and wrong corpus or data source (n=7).Carrión-Toro et al. (2022), for example, developed a method and software tool supporting researchers with selection of relevant key criteria in a field of study based on term frequencies.While text summarization has proven valuable for evidence synthesis tasks, the primary focus of this LSR involves efforts to extract specific data points from primary research (O'Connor et al., 2019).We also excluded extraction techniques that were not applied to abstracts or full text of research articles.Ochoa-Hernández et al. (2018), for instance, presented a method to automatically extract concepts from web blog articles.
The second most common exclusion category were articles that presented techniques or systems utilizing pre-extracted data (n=4).Ali and Gravino (2018), for example, proposed an ontology-based SLR system with semantic web technologies; however, the data (derived from a prior review conducted by the authors) were added to the ontology system after the manual extraction stage.Finally, articles were excluded due to exclusive application in medical/clinical research (n=2), or the proposed tool had not yet been implemented (n=2).Goswami et al. (2019), for example, described and evaluated a supervised ML framework to identify and extract anxiety outcome measures from clinical trial articles.Zhitomirsky-Geffet et al. (2020) presented a conceptual description of a network-based data model capable of mining quantitative results from social sciences articles, but the system had not been implemented at the time of publication.

Included publications
The majority of included studies (n=12) presented or described a software tool, system, or application to support researchers extracting data from research literature.The second most common inclusion category focused on the development of specialized techniques or methods for automating data extraction tasks (n=9).We identified two studies that evaluated or tested the performance of existing tools or methods for (semi)automated data extraction.Unlike related reviews of data extraction methods for healthcare interviews (see Schmidt et al., 2023), we did not identify social science studies applying existing automated data extraction tools to conduct secondary research.

Automated approaches
To report approaches identified, we organized the extracted data under four overarching categories, including: (1) data preprocessing and feature engineering, (2) model architectures and components, (3) rule-bases, and (4) evaluation metrics.See 'Data Availability' section for labeling and additional descriptions of techniques.We opted to extract and report rule-based techniques separately because the approaches we identified intertwined with various aspects of the data processing and extraction pipeline, spanning data preprocessing to the model architecture itself.This distinction allows for more discussion about the prevalence, scope and utility of these techniques.

Data preprocessing and feature engineering
The data preprocessing category encompasses methods and techniques used to preprocess raw text and data before it is fed into ML/NLP models.This includes tasks such as tokenization, stemming, lemmatization, stop word removal, and other steps necessary to clean and prepare the text data for analysis.Figure 5 plots the aggregate results of preprocessing techniques identified.
Nearly all studies applied tokenization and/or segmentation (83%, n=19) for breaking down text into manageable units.
Similarly, PDF parsing/extraction techniques were applied in 65% (n=15) of studies, the remaining studies applied extraction to other document formats (e.g., journal articles available online in HTML format; see Diaz-Elsayed & Zhang, 2020).While similar methods, which additionally take into account syntactic structure, including chunking and dependency parsing were less frequently applied (Angrosh et al., 2014;Li et al., 2022;Nayak et al., 2021;Pertsas & Constantopoulos, 2018).Tagging methods, including PoS tagging (assigning grammatical categories, e.g., noun, verb), followed by concept tagging (e.g., semantic annotation), or sequence tagging, where labels were assigned based on order of appearance, were used in 43% (n=15) of studies.Nine studies used manual annotation for training and/or evaluation.
Feature engineering (e.g., ranking functions, representation learning and feature extraction techniques) covers a range of methods essential for transforming raw text data into structured, machine-readable representations to facilitate downstream ML/NLP tasks (Kowsari et al., 2019).See Figure 6.
Word embeddings were the most frequently used techniques.We grouped ELMo (word embeddings from language models) with traditional word embeddings such as Word2Vec and Glove (Kowsari et al., 2019;Young et al., 2018,

Model architectures and components
The model architecture category focuses on the architectures and components of ML/NLP models used for data extraction.Results are shown in Figure 7.Some approaches overlapped across applicationse.g., semantic web or semantic indexing structures and ontology-pipeline approacheswe grouped these techniques into categories to facilitate reporting.Likewise, all transformer-based approaches were grouped into a single category, however, specific architectures and components are discussed in the sections below, and detailed coding of extracted data is available in the supplemental data files (see 'Underlying Data' section).Where ruled-based approaches were considered a part of the system architecture, they are reported under the 'Rule-bases' section.
Overall, approaches ranged from straightforward implementations to complex layered architectures.Examples of more straightforward approaches included architectures based entirely on rule-bases (e.g., Diaz-Elsayed & Zhang, 2020), applications based one classification method (e.g., naïve Bayes; Neppalli et al., 2016), or those utilizing a single type of probabilistic model (Angrosh et al., 2014;Iwatsuki et al., 2017).At the other end of the complexity continuum, Nowak and Kunstman ( 2019) presented an end-to-end deep learning model based on a BI-LSTM-CRF architecture with interleaved alternating LSTM layers and highway connections.In the following sections, we further elaborate on various approaches identified.
Ontology-based and Semantic Web.These pipelines involve leveraging ontologies and semantic web technologies for semantic annotation or knowledge representation.Among included studies, ontology and semantic web capabilities were explored as early as 2014, but the preliminary results from this baseline review suggest an upward trend in recent years.

Evaluation metrics
Evaluation metrics are presented in Figure 9. Precision, recall, F-scores, and accuracy were predominantly reported across studies, including the earliest published articles.For assessment of model performance, six studies used crossvalidation (CV), a process of "averaging several hold-out estimators of the risk corresponding to different data splits" (Arlot & Celisse, 2010, p. 53).K-fold CV (5 or 10 folds) was predominantly applied (Angrosh et al., 2014;Iwatsuki et al., 2017;Neppalli et al., 2016;Shen et al., 2022, with

Availability, accessibility and transferability
While only one study we reviewed presented an existing tool that was accessible to users through an online application (sysrev.com;Bozada et al., 2021) at the time of conducting this baseline review, two other studies were either being prepared or were available through other means.These included the Holistic Modifiable Literature Review tool (Denzler et al., 2021) 2020) affirmed the domain-independent nature of their framework, suggesting its suitability for various systematic reviews.
Additionally, other studies highlighted the need for transferability and discussed the potential for their research tools and technologies to be extended and adapted across varying domains, stressing the importance of flexible design principles in the development of these tools (Angrosh et al., 2014;Diaz-Elsayed & Zhang, 2020).Angrosh et al. ( 2014) explained how SENTCON's preliminary design was applied to a specific set of articles in computer science but emphasized that the tool was flexible enough to be applied to other domains through the use of the Web Ontology Language (OWL).Diaz-Elsayed & Zhang (2020) presented methods that were initially applied to wastewater-based resource recovery, but likewise emphasized that the tool was capable of evaluating other engineered systems and retrieving different types of data than those initially extracted.
As noted by Chen et al. ( 2021), while efforts are being made to assist the process of conducting systematic reviews there is often limited generalizability of domain-specific pre-trained language models.Many studies included in our review dedicated discussion points toward addressing the critical issue of generalizability and transferability of tools developed to support the broader research community in (semi)automated data extraction tasks.Collectively, these studies suggest a positive trend toward the development of adaptable, transferable research tools and technologies.However, they also underscore the need for ongoing effort across and between diverse domains to make continued progress toward broader research applications.

APA data elements
This section discusses potential for extraction of key data elements of interest, as well as locations (i.e., paper sections), structures, and review tasks addressed by the studies reviewed.We limited this section to reporting tools that users could theoretically access and use to support their own research projects.There were 12 studies that presented systems or artifacts designed to facilitate various tasks associated with identifying and extracting data from published literature.
To avoid speculating as to the future availability of these tools, we included all studies which presented tools or systems where authors incorporated user interfaces (UIs), regardless of availability at the time of conducting this base review.

Structure, location, and review tasks
Table 5 provides an overview of structure and location of extracted data elements, followed by review tasks supported by tools identified.The majority developed approaches for (semi)automating extraction of data from any section of full text research articles.Two studies tested the proposed techniques on specific article sections, including titles and abstracts     et al., 2022).Regarding structure from which data were extracted, all except one extracted from unstructured text, two extracted from both tabular structures (i.e., tables) and text (Nayak et al., 2021;Pertsas & Constantopoulos, 2018), and one was designed specifically to extract elements from tables (TableSeer;Liu et al., 2007).
All tools focused heavily on tasks related to data extraction (e.g., identification, labeling/annotation, ontology population), which was anticipated based on our search strategy and inclusion criteria.However, several studies advanced solutions for supporting other SLR tasks or stages (see Tsafnat et al., 2014).The most common task (excluding data extraction) was literature search (Bayatmakou et al., 2022;Bozada et al., 2021;Chen et al., 2020;Denzler et al., 2021;Liu et al., 2007;Pertsas & Constantopoulos, 2018).Many tasks listed are likely supported by a range of computational tools and techniques (e.g., synthesize and meta-analyze results); readers interested in (semi)automating other SLR stages are referred to the Systematic Review Toolbox for an extensive catalogue of tools and methods (Marshall et al., 2022).

Challenges
A number of challenges were reflected within the body of evidence included in this baseline review.These challenges included difficulties in identifying functional structures within unstructured texts (Shen et al., 2022), extracting data from PDF file sources (Nayak et al., 2021;Goldfarb-Tarrant et al., 2020;Iwatsuki et al., 2017), and accurately detecting in-line mathematical expressions (Iwatsuki et al., 2017).Computational complexity created another significant obstacle for researchers, with issues arising from text vectorization methods, optimization problems, and the computational resources required by neural network frameworks (Bayatmakou et al., 2022;Anisienia et al., 2021).Furthermore, challenges associated with annotation, particularly biases introduced through the automated processes and limitations of available datasets, were a topic of discourse (Li et al., 2022;Nowak & Kunstman, 2019;Torres et al., 2012).
Compared to the medical field, domain-specific challenges, particularly those in social sciences and related fields, necessitated tailored approaches, which can become time-consuming as researchers often lack sufficient training data to develop robust classifiers (Chen et al., 2021;Aumiller et al., 2020;Zielinski & Mutschke, 2017).Additionally, metaanalytic methods often face hurdles related to data representation variability, which has significant limitations in the use of data extraction tools, and class imbalance in the development of classification tasks (Aumiller et al., 2020;Neppalli et al., 2016;Goldfarb-Tarrant et al., 2020).

Conclusions
The findings of the baseline review indicate that when considering the process of automating systematic review and meta-analysis information extraction, social science research falls short as compared to clinical research that focuses on automatic processing of information related to the PICO framework (i.e., Population, Intervention, Control, and Outcome;Eriksenand Frandsen, 2018;Tsafnat et al., 2014).For example, while an LSR focusing on clinical research that is based on the PICO framework yielded 76 studies that included original data extraction (Schmidt et al., 2023), the present review of social science research yielded only 23 relevant studies.This is not necessarily surprising when considering the breadth of social science research and the lack of unifying frameworks and domain specific ontologies (Göpfert et al., 2022;Wagner et al., 2022).
With a few exceptions, most tools we identified were either in the infancy stage and not accessible to applied researchers, were domain specific, or require substantial manual coding of articles before automation can occur.Additionally, few solutions considered extraction of data from tables, which is where many elements (e.g., effect sizes) reside that social and behavioral scientists analyze.Further, development appears to have ceased for many of the tools identified.
We found no evidence indicating hesitation on the part of social science researchers to adopt data extraction tools, on the contrary, abstractive text summarization approaches continue to gain traction across social science domains (Cairo et al., 2019;Feng et al., 2017).While these methods aid researchers in distilling complex information into meaningful insights, there remains a gap in technologies developed to augment human capabilities in the extraction of key data entities of interest for secondary data collection from quantitative empirical reports.
The impact of time-intensive research activities on translational value is not a new concern for the SLR research community.In many social sciences, emphasis is often placed on practical application and translational value, underscoring the importance of efficient research methodologies (Githens, 2015).Further development of the identified systems and techniques could mitigate time delays that often result in outdated information as researchers cannot feasibly include all new evidence that may be released throughout the lifetime of a given project (Marshall & Wallace, 2019).

Limitations
As with any method that involves subjectivity, results can be influenced by a variety of factors (e.g., study design, publication bias, researcher judgment, etc.).We worked diligently to conduct this review and document our procedures in a systematic and transparent manner; however, efforts to replicate our search strategy or screening processes may not result in the same corpus or reach the same conclusions (Belur et al., 2018).This baseline review presented an opportunity to better develop our search and screening strategy, methodological procedures, and research goals.Moving forward, we have developed a codebook and assessment procedures to increase the transparency and reliability of our research.
A second limitation of this study was the omission of snowballing as a search strategy.Though we did not uncover applied secondary research articles utilizing automation tools, several potentially useable tools and systems were discovered in the course of this review.For future iterations of this LSR, we plan to incorporate forward snowballing from relevant articles in previous searches as part of our formalized search strategy (see Wohlin et al., 2022).Additionally, our search strategy has limitations related to its focus on English-language publications, the non-exhaustive list of databases and sources consulted, and the exclusion of grey literature.Addressing these aspects in future updates could enhance the comprehensiveness of findings and provide a broader perspective on the current state of automation tools in secondary research.

Reporting guidelines
This study follows PRISMA reporting guidelines (Page et al., 2021).
Open The reference from Yu et al. below is mostly concerning screening automation and not data extraction if I am not missing a major point in the paper.If that is correct then there may exist better works to reference in this context?"the process of extracting data from primary research is a labor-intensive effort, fraught with the potential for human error (see Pigott & Polanin, 2020;Yu et al., 2018)."I am not an expert in social science research, but a few included references in  2021) about cotton industry?Regarding this sentence in the conclusions, it might be more up-to-date to reference the review update from 2023 with 76 included papers: "For example, while an LSR focusing on clinical research that is based on the PICO framework yielded 53 studies that included original data extraction (Schmidt et al., 2021)" One of the challenges with living updates is to adapt the search whenever there are new developments in a field of research.You may have already considered adapting the search strategy to make sure that new methods relying on large language models (LLM) like GPT or T5 are picked up?There may be relevant articles coming through soon, for example https://arxiv.org/abs/2405.14445 may be of interest for a future review update as it looks at social science study data extraction and if it is, then it would be good to make sure that the search can pick up the terminology correctly.In the methodology section, could you please state the dates when the search relevant to the baseline review cutoff was conducted (for each data source if different) ?" Are the rationale for, and objectives of, the Systematic Review clearly stated?Yes Are sufficient details of the methods and analysis provided to allow replication by others?Yes Is the statistical analysis and its interpretation appropriate?Yes

Are the conclusions drawn adequately supported by the results presented in the review? Yes
If this is a Living Systematic Review, is the 'living' method appropriate and is the search schedule clearly defined and justified?('Living Systematic Review' or a variation of this term should be included in the title.)Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Systematic review automation, automated data extraction (clinical trials), natural language processing, living reviews I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 22 Sep 2024

Amanda Legate
We are honored that you agreed to review our research and sincerely appreciate your thoughtful review and feedback.Please find responses to each comment below.

Comment 1
The reference from Yu et al. below is mostly concerning screening automation and not data extraction if I am not missing a major point in the paper.If that is correct then there may exist better works to reference in this context?"the process of extracting data from primary research is a labor-intensive effort, fraught with the potential for human error (see Pigott & Polanin, 2020;Yu et al., 2018) Comment 2: Response Thank you for your insights on the relevance of references in Table 2. Our search strategy was intentionally broad to include studies utilizing (semi)automated data extraction methods across various domains, provided they were not solely focused on clinical research.The goal was to ensure comprehensiveness; however, we understand your concern regarding the ambiguity of some references' relevance to social sciences.As our study is a "living" review, we see this as an excellent opportunity to consider refining our inclusion criteria in future updates.We will explore more targeted approaches that can help streamline the search strategy, potentially focusing on research that more directly applies to social sciences or explicitly demonstrates transferable methodologies that align with the needs of social science researchers.Additionally, we are discussing options for collaborating with experts who specialize in bibliometric analysis or search strategy optimization to ensure that our review remains focused, relevant, and complementary to your work.
Comment 3 Regarding this sentence in the conclusions, it might be more up-to-date to reference the review update from 2023 with 76 included papers: "For example, while an LSR focusing on clinical research that is based on the PICO framework yielded 53 studies that included original data extraction (Schmidt et al., 2021)".
Comment 3: Response Thank you for pointing out this important update, we appreciate your diligence in ensuring that references are current and reflective of the most recent findings.The manuscript has been revised to reference the 2023 update of your LSR to accurately reflect the most up-todate results.
Comment 4 One of the challenges with living updates is to adapt the search whenever there are new developments in a field of research.You may have already considered adapting the search strategy to make sure that new methods relying on large language models (LLM) like GPT or T5 are picked up?There may be relevant articles coming through soon, for example https://arxiv.org/abs/2405.14445 may be of interest for a future review update as it looks at social science study data extraction and if it is, then it would be good to make sure that the search can pick up the terminology correctly.
Comment 4: Response Thank you for highlighting the importance of adapting the search strategy to capture emerging developments in automation technologies, particularly those involving large language models (LLMs) like GPT or T5.We completely agree that a key aspect of maintaining the relevance and rigor of a living systematic review is to continuously update the search strategy to reflect the current state-of-the-art in the field.We will incorporate this valuable feedback into future iterations by updating our search terms and strategies to include LLM-related methodologies and terminologies, ensuring the inclusion of new and relevant articles.The paper you referenced (https://arxiv.org/abs/2405.14445)serves as an excellent example, and we will use it to refine our search criteria.This approach will help us stay current with advances in data extraction techniques.Thank you for providing specific references to guide this adaptation.
Comment 5 In the methodology section, could you please state the dates when the search relevant to the baseline review cutoff was conducted (for each data source if different) ?" Comment 5: Response Thank you for this thoughtful suggestion.While we reported the search dates in the extended data files housed in OSF, we agree with you that including them directly in the Methods section would add clarity and value for readers.We have updated the section to specify the dates when searches were conducted for each data source, ensuring this information is clear and accessible to readers.
Competing Interests: No competing interests were disclosed.Since this review does not include any qualitative or quantitative synthesis per se, but rather provides an overview of the field (methods for semi-automated data extraction), I suggest removing "living systematic review" and adding "living systematic map." 2. 1.

Abstract:
The summary of methods could include more detailed information on searches, screening, critical appraisal, and synthesis.Please specify which standards for review conduct were followed. 1.
The summary of results could provide more information (briefly) about the included 2.

studies. Keywords:
Avoid repeating terms already present in the title 1.

Introduction:
The focus of this review-extraction tools for quantitative data-should be more explicitly stated.This emphasis needs to be clearer in the introduction and reflected in the title, as mentioned earlier.Specifically, the first paragraph of the Introduction should be revised to concentrate on the review topic-quantitative data extraction and existing tools-rather than a general introduction to meta-science or related areas. 1.
Additional details are needed on how this review contributes to and complements existing reviews on the topic.This information should be included in the "Related Research" section. 2.

4.
Objectives: It would be helpful to define what is included under "social science research domains".1.

Methods:
Authors should be transparent and explicit about the guidelines and standards for both conduct and reporting that were used.Please clarify this at the beginig of the Methods section. 1.
The methods section should begin by addressing any deviations from the protocol.If there were no deviations, this should be clearly stated as well.

2.
Did you use any automation technologies to screen or select studies for this review?If yes, please clarify. 3.

6.
Methods/Eligibility criteria: The eligibility criteria should be explicit about the field within which methods for (semi)automated data extraction are applied.

1.
A definition of "(semi)automated" is needed.The eligibility criteria currently state that semi-automated approaches will be eligible but then refer to "any automated approach to data extraction" in the next sentence.This needs to be clarified-are the focus and criteria on semi-automated or automated approaches?Be more explicit and precise in the description of the eligibility criteria, and ensure alignment with the protocol.

2.
Instead of "We excluded studies labeled as editorials, briefs.." you may write "Editorials, briefs, …were not considered eligible" (and similar changes may be applied to the following sentence) 3.

7.
Methods/Searches Be explicit about the citation indices included in your Web of Science subscription and note which library was used to access WoS.This will increase transparency and replicability of your searches. 1.
Clarify why following Schmidt et al.'s search strategy was important, given the different scope of this review.Consider including more social science databases to ensure comprehensive coverage.Did you include the Social Science Citation Index (within WoS)? 2.
Provide explanations for all abbreviations (IEEE, ACL, etc.) in the text.3.

Methods/Study selection
Clarify if three researchers simultaneously screened titles and abstracts (TA), and whether inter-rater reliability (IRR) was calculated for TA screening.How you trained 1.

9.
reviewers to apply eligibility criteria?
The sentence, "In cases where level of abstraction and potential for transferability could not be determined from the abstract alone, full text articles were reviewed and discussed by all three researchers until consensus was reached", should more clearly state that there was NO full-text screening of all records (if this is correct), only of a sub-sample where abstracts did not clearly describe AI technology, etc.

2.
Relatedly, Figure 1 should be adjusted to avoid giving the false impression that all records were screened in full text.

3.
This review seems to involve META-data extraction rather than DATA extraction.Please adjust the text and figures accordingly.

4.
It is not clear if IRR assessments were conducted for meta-data extraction.Please clarify/be explicit.If IRR was not done, describe how researchers were trained to use the extraction form.

5.
The sentence, "coding forms allowed for input of "other" responses (e.g., APA data elements) that were not included in extant reviews that focus on medical and clinical data extraction (e.g., PICO elements)" is unclear.Consider removing or clarifying and linking it better with the rest of the text.

6.
Describe the procedure for screening and meta-data extraction of studies authored by the review team.

7.
Methods/Critical appraisal and Synthesis These sections are missing.Please state clearly if a critical appraisal of included studies was conducted and if so, how was it performed.Also, describe how synthesis was conducted. 1.

10.
Results/Challenges Clarify that the described challenges reflect issues within the body of evidence included in this (baseline) review (otherwise this section can be mixed up with review limitations). 1. 11.

Conclusions/Limitations
Organize limitations into those related to the methodology used and those related to the evidence base. 1.
Discuss limitations related to the focus on publications in English, the inexhaustive list of search sources, and the lack of grey literature.

2.
12. We would like to express our genuine appreciation for the important work you and your colleagues have done in developing the ROSES (Reporting standards for Systematic Evidence Syntheses) guidelines for systematic evidence synthesis in environmental science.Improving transparency and standardization in research reporting is a goal we fully support.While we acknowledge the value of the ROSES guidelines, they were not the reporting standard required or appropriate for our systematic review.We noticed that many of your comments seem to assess our manuscript against the ROSES guidelines (Haddaway et al., 2017a;2017b;2018;Haddaway & Macura, 2018).For example, the suggestion to emphasize "meta-data extraction" aligns more with ROSES, whereas PRISMA does not require such differentiation and focuses on clarity in describing the data collection process, whether it involves meta-data or primary data points.

Are
We believe it is essential to assess our work based on the scope and framework provided by PRISMA rather than extend it beyond its current focus to fit an alternative reporting framework.Further, suggestions which are biased towards a reviewer's own work conflict with the expectations of fair peer review.We are committed to making revisions that enhance the clarity and rigor of our research while remaining consistent with the standards required by the journal.
Thank you again for your constructive feedback and for considering our clarifications.We look forward to moving forward in a way that upholds the standards of rigorous and unbiased scholarly review.
Comment 1 & 2: Response Thank you for these valuable suggestions regarding the title.We have considered (1) specifying the type of data being extracted in the title and (2) changing the title from "living systematic review" to "living systematic map."However, we have retained the original title to ensure consistency with our pre-registered protocol, adhere to PRISMA reporting standards, and comply with F1000Research guidelines.
Comment 3 [Abstract] The summary of methods could include more detailed information on searches, screening, critical appraisal, and synthesis.Please specify which standards for review conduct were followed.
Comment 3: Response Thank you for the suggestion to provide more detailed information on searches, screening, critical appraisal, and synthesis in the abstract to better align with ROSES reporting recommendations.To ensure compliance with the journal's requirements, we followed the PRISMA guidelines for a structured summary, which emphasize conciseness in presenting objectives, eligibility criteria, methods, results, and conclusions.While we understand the desire for additional details, we believe the current abstract aligns with these guidelines but will review it again to ensure optimal clarity.
Comment 4 [Abstract] The summary of results could provide more information (briefly) about the included studies.
Comment 4: Response Thank you for the recommendation to provide more information about the included studies within the abstract.We will incorporate a brief summary of the included studies' key characteristics and findings in future updates to this review to enhance clarity and completeness.
Comment 5 [Keywords] Avoid repeating terms already present in the title Comment 5: Response Thank you for highlighting ROSES guidance indicating that keywords do not repeat the title but rather provide additional context.Where appropriate, we will revise keywords to avoid redundancy and enhance discoverability.
Comment 6 [Introduction] The focus of this review-extraction tools for quantitative data-should be more explicitly stated.This emphasis needs to be clearer in the introduction and reflected in the title, as mentioned earlier.Specifically, the first paragraph of the Introduction should be revised to concentrate on the review topic-quantitative data extraction and existing tools-rather than a general introduction to meta-science or related areas.
Comment 6: Response Thank you for your feedback on clarifying the focus of our review.Our study does not exclusively focus on extraction tools for quantitative data; it encompasses approaches to data extraction for both quantitative and qualitative data elements relevant to evidence synthesis in systematic reviews and meta-analyses within social sciences.To better reflect this broader focus, we have revised the objective section to explicitly state that the review covers data extraction tools for a range of data types.We hope this adjustment will provide clearer insight into the comprehensive scope of our review.
Comment 7 [Introduction] Additional details are needed on how this review contributes to and complements existing reviews on the topic.This information should be included in the "Related Research" section.
Comment 7: Response Thank you for this insightful comment.We agree on the importance of clearly situating our review within the existing literature to highlight its unique contributions.Although we did not adhere to ROSES guidelines for explaining the review's relevance to existing literature, we followed PRISMA guidelines in the "Related Literature" section to identify relevant prior reviews and synthesize their focus, findings, and limitations.We will consider ways to enhance this section to better emphasize our review's distinct contributions moving forward.
Comment 8 [Objectives] It would be helpful to define what is included under "social science research domains".
Comment 8: Response Thank you for this suggestion.Our pre-registered research protocol and the "Baseline Review Search Strategy" document (available in the project's OSF repository) provide a comprehensive list of over 100 subject categories included under social science research domains, ranging from sociology and political science to interdisciplinary areas such as "Social Sciences Mathematical Methods."To enhance clarity, we have updated the objectives section to include more details and a reference to the extended data file.
Comment 9 [Methods] Authors should be transparent and explicit about the guidelines and standards for both conduct and reporting that were used.Please clarify this at the beginning of the Methods section.
Comment 9: Response We acknowledge that the ROSES guidelines recommend transparency in reporting the guidelines and standards for both conduct and reporting at the beginning of the Methods section.However, in accordance with the journal's article guidelines for living systematic reviews (available from: https://f1000research.com/for-authors/article-guidelines/livingsystematic-reviews),this information is provided in the "Reporting Guidelines" section.
Comment 10 [Methods] The methods section should begin by addressing any deviations from the protocol.If there were no deviations, this should be clearly stated as well.
Comment 10: Response Thank you for highlighting this important aspect.While we did not adopt the ROSES reporting standards for this research, we recognize their guidance on stating any deviations from the protocol at the beginning of the methods section.We have addressed any deviations in the appropriate sections of the paper, and additional descriptions are provided in the extended data files to ensure transparency and replicability.
Comment 11 [Methods] Did you use any automation technologies to screen or select studies for this review?If yes, please clarify.
Comment 11: Response Thank you for your question.The use of automation technologies is detailed in the "Search Sources" and "Study Selection" subsections of the Methods section.Additionally, to ensure transparency and replicability, further details are provided in the "Software Availability" section, as per F1000Research guidelines.
Comment 12 [Methods/Eligibility criteria] The eligibility criteria should be explicit about the field within which methods for (semi)automated data extraction are applied.
Comment 12: Response Thank you for this comment.To ensure clarity, we have referenced the extended data files in the text, which provide comprehensive details and a full list of over 100 research fields.These details are openly available in the project repository, as specified in the protocol (please see response to Comment 8).
Comment 13 [Methods/Eligibility criteria] A definition of "(semi)automated" is needed.The eligibility criteria currently state that semi-automated approaches will be eligible but then refer to "any automated approach to data extraction" in the next sentence.This needs to be clarified-are the focus and criteria on semi-automated or automated approaches?Be more explicit and precise in the description of the eligibility criteria and ensure alignment with the protocol.
Comment 13: Response Thank you for your suggestion to clarify the phrasing regarding the eligibility criteria.We have revised the description to specify that the focus is on any "technique" applied for extracting data from literature in a semi-automated manner.This adjustment aligns with the study protocol.
Comment 14 [Methods/Eligibility criteria] Instead of "We excluded studies labeled as editorials, briefs.." you may write "Editorials, briefs, …were not considered eligible" (and similar changes may be applied to the following sentence).
Comment 14: Response Thank you for the suggestion.We have revised the text to use passive construction, as recommended.We have also applied similar changes to the following sentence for consistency.
Comment 15 [Methods/ Searches] Be explicit about the citation indices included in your Web of Science subscription and note which library was used to access WoS.This will increase transparency and replicability of your searches.
Comment 15: Response Thank you for this suggestion.To avoid redundancy in the manuscript, we have added a statement directing readers to the extended data files, which provide additional detail related to WoS indices and search settings.
Comment 16 [Methods/ Searches] Clarify why following Schmidt et al.'s search strategy was important, given the different scope of this review.
Comment 16: Response Thank you for this comment.To clarify, we followed Schmidt et al.'s search strategy to ensure comprehensive coverage of relevant databases and consistency in methodological rigor, which is important even with a different scope.To avoid redundancy, we have added a reference in the manuscript directing readers to the extended data file and research protocol in our open-access repository, where this rationale is explained in detail.
Comment 17 [Methods/ Searches] Consider including more social science databases to ensure comprehensive coverage.
Comment 17: Response Thank you for this valuable suggestion.We appreciate the importance of comprehensive coverage and will consider including additional social science databases in future updates to further enhance the scope of our review.
Comment 18 [Methods/ Searches] Did you include the Social Science Citation Index (within WoS)?
Comment 18: Response Yes, the Social Science Citation Index within Web of Science was included.We have updated the text to clarify that all editions, settings, and search syntax used are detailed in the extended data files available in the open-access repository.
Comment 19 [Methods/ Searches] Provide explanations for all abbreviations (IEEE, ACL, etc.) in the text.
Comment 19: Response Thank you for the suggestion.We have added explanations for all source abbreviations (e.g., IEEE, ACL) in the text to improve clarity for readers.
Comment 20 [Methods/ Study selection] Clarify if three researchers simultaneously screened titles and abstracts (TA), and whether inter-rater reliability (IRR) was calculated for TA screening.How you trained reviewers to apply eligibility criteria?
Comment 20: Response Thank you for your question.The "Study Selection" section of the paper details independent screening procedures, training process for reviewers on applying eligibility criteria, and inter-rater reliability (IRR) considerations.
Comment 21 [Methods/ Study selection] The sentence, "In cases where level of abstraction and potential for transferability could not be determined from the abstract alone, full text articles were reviewed and discussed by all three researchers until consensus was reached", should more clearly state that there was NO full-text screening of all records (if this is correct), only of a sub-sample where abstracts did not clearly describe AI technology, etc.
[Methods/ Study selection] Describe the procedure for screening and meta-data extraction of studies authored by the review team.
Comment 26: Response Thank you for highlighting ROSES guidance surrounding procedures for handling studies authored by the review team.However, no alternative procedures were implemented; therefore, there are no additional procedures to report.
Comment 27 [Methods/ Critical appraisal and Synthesis] These sections are missing.Please state clearly if a critical appraisal of included studies was conducted and if so, how was it performed.Also, describe how synthesis was conducted.
Comment 27: Response Thank you for your comment.These sections are specific to ROSES guidelines.These sections are not required by PRISMA or the journal's reporting standards.
Comment 28 [Results/Challenges] Clarify that the described challenges reflect issues within the body of evidence included in this (baseline) review (otherwise this section can be mixed up with review limitations).
Comment 28: Response To avoid confusion with review limitations, we have revised the first sentence of this section to clarify that the challenges discussed specifically reflect issues within the body of evidence included in this baseline review.
Comment 29 [Conclusions/Limitations] Organize limitations into those related to the methodology used and those related to the evidence base.
Comment 29: Response While our research and protocol were developed following PRISMA guidelines rather than ROSES, which requires a structured discussion of limitations, we appreciate the value of differentiating between methodological constraints and evidence base gaps.We will consider this distinction in future updates to enhance clarity.
Comment 30 [Conclusions/Limitations] Discuss limitations related to the focus on publications in English, the inexhaustive list of search sources, and the lack of grey literature.
Comment 30: Response Thank you for the suggestion.We have updated the limitations section to address the focus on publications in English, the inexhaustive list of search sources, and the lack of grey literature.
If this is a Living Systematic Review, is the 'living' method appropriate and is the search schedule clearly defined and justified?('Living Systematic Review' or a variation of this term should be included in the title.)Yes Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure1.LSR workflow.This image is reproduced under the terms of a Creative Commons Attribution 4.0 International license (CC-BY 4.0) from Legate and Nimon (2023b).Note.Arrows represent stages involved in a static systematic review; the dotted line (from "Publish Report" to "Search") represents the stage at which the review process is repeated from the beginning while the review remains in living status.

Figure 4 .
Figure 4. Included publications.Note.Presented Tool = Describe/demonstrated a software tool, system, or application for data extraction (n=12), Developed Method = Developed techniques and/or methods for automated data extraction (n=9); Evaluated Techniques = Tested or evaluated the performance of existing tools, techniques, or methods (n=2); Applied Tool = Applied automation tools to conduct secondary research (n=0).
Chapter 6).Of these, GloVe was used in four studies(Chen et al., 2021;Goldfarb-Tarrant et al., 2020; Nowak &  Kunstman, 2019; Anisienia et al., 2021)  and ELMo in two(Nowak & Kunstman, 2019; Anisienia et al., 2021).The most common frequency-based feature representation approaches were Bag-of-Words (BoW, n=5) and Term frequency-Inverse Document Frequency (TF-IDF, n=4).Although less frequently applied in the corpus, methods for representing words or documents as vectors based on semantic properties such as Vector Space Models (VSM) and sentence embeddings were used as early as 2007.Other less commonly reported methods included synonym aggregation/ expansion, best match ranking (BM25), shingling, and subject-verb pairings.
types of sentences (e.g., objective, results, conclusions) and by Goldfarb-Tarrant et al. (2020) for splitting papers into specific sections (e.g., abstract, introduction, methods).Alternatively, Pertsas and Constantopoulos (2018) used RegEx to exploit lexico-syntactic patterns derived from an ontology knowledge base (Activities, Goals, and Propositions).Other RegEx uses included modifying datasets to incorporate patterns related to citation mentions (Anisienia et al., 2021) or application of rule-based chunking and processing to identify and extract relevant chunks from text (Nayak et al., 2021).The remaining six studies described custom rule-based algorithms or other heuristic approaches.Li et al. (2022), for example, applied rule-based algorithms PrefixSpan and Gap-Bide for the extraction of frequent discourse sequences.RAKE (Rapid Automatic Keyword Extraction) was applied by Bayatmakou et al. (2022) to extract keywords which served as representations of a document's content.And Aliyu et al. (2018) described a rule-based algorithm developed for processing full-text documents to identify and extract section headings.
An outcome we did not anticipate was the substantial number of open source tools, toolkits, and frameworks utilized by our relatively small corpus of articles.Because we were unsure what to expect, we made every effort to capture evidence that might prove useful to social science researchers.We identified 50 different open source technologies including platforms, software, software suites, packages/libraries, algorithms, pre-trained models, controlled vocabularies/thesauri, lexical databases, knowledge representations, and more.Open source tools identified are reported in Figure10.Of the open source resources available to researchers, the overwhelming majority were Python tools (n=16; see Python Package Index, https://pypi.org/)and 8 of 23 (35%) studies used the Python Natural Language Toolkit (NLTK).The full list of open-source tools and license details are available in the OSF repository (see 'Underlying Data' section).

Reviewer Report 19
August 2024 https://doi.org/10.5256/f1000research.166140.r298402© 2024 Macura B. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Biljana Macura 1 Stockholm Environment Institute, Stockholm, Sweden 2 Stockholm Environment Institute, Stockholm, SwedenThis manuscript represents an important contribution to the evidence synthesis methodology.Given the rise of AI technology, a living evidence base on approaches to data extraction will be very useful.However, the manuscript could benefit from improved clarity.Below are my comments:Title: Clarify the type of data being extracted (qualitative, quantitative, or mixed).1.
Note.AI = Artificial Intelligence, ML = Machine Learning, SLR = Systematic Literature Review, NLP = Natural Language Processing, NR = Not Reported.
, excluding medical research sources.Because IEEE content is indexed in Web of Science (Young, 2023), we did not include IEEE Xplore as a separate source.We added two additional databases (ACL and ArXiv) and conducted a search for data extraction tools in the Systematic Review Toolbox(Marshall et al., 2022)to capture associated articles.Searches were conducted in the The Web of Science search and deduplication followed procedures stated in the protocol(Legate & Nimon, 2023b).We adapted source code developed by Schmidt et al. (2021) for automating search, retrieval, and deduplication functions on full database dumps for ACL, ArXiv, and DBLP platforms.Complete details, including citation indices and specific setting applied, search syntax, and adapted source code are available in the project repository (see 'Data availability' section).

Table 2 .
Continued Nayak et al., 2021)2022)Shen et al., 2022)ped a Sentence Context Ontology (SENTCON) for modeling the contexts of information extracted from research documents.Piroi et al. (2015)developed and presented an annotation system for populating ontologies in domains lacking adequate dictionaries.Some work focused on automatically mapping structures of research documents.For example, using an open source lexical database to develop a canonical model of structure,Aliyu et al. (2018)were able to automatically identify and extract target paper sections from research documents.Shahid and Afzal (2018) utilized specialized ontologies to automatically tag content in research papers by logical sections.Chen et al. (2020) presented a novel framework for text summarization, including ontology-based topic identification and userfocused summarization modules.) and SciBERT(Goldfarb-Tarrant et al., 2020; Li et al., 2022)were the most utilized for tasks relevant to extracting data from research in social sciences.Others language models included BioBERT(Chen et al., 2020)and distilBERT(Goldfarb-Tarrant et al., 2020).We identified a recent application of the Hugging Face LED model(Li et al., 2022), a pretrained longformer model developed to address length limitations associated with other transformer-based approaches (seeBeltagy et al., 2020).Six of the included studies applied Named Entity Recognition (NER) techniques.Increasing availability of tools to support the entire SLR pipeline, including data extraction efforts, may be partially to credit for upward trends in NER applications.Based on applications we identified, NER would best be described as versatile.Some studies incorporated NER as an integral component embedded throughout a larger ML/NLP pipeline (e.g.,Goldfarb-Tarrant et al., 2020), others included NER subcomponents leveraged primarily for preprocessing and feature representation tasks (e.g.,Pertsas & Constantopoulos, 2018), and in one study, authors took advantage of open source NER tools that could be easily integrated into a highly modifiable artifact serving as platform for future development of holistic approaches to scaling SLR tasks (e.g.,Denzler et al., 2021).Extractive questing-answering models involve tasks where a model generates answers to questions based on a given context.Question-answering models appeared in our dataset as early as 2007(Liu et al., 2007), with the remaining applications published in 2020 or later.Question answering techniques have a range of applications that most readers are likely familiar with, like chatbots and intelligent assistants (e.g., Alexa, Google Assistant, Siri).However, state-of-the-art approaches for question-answering over knowledge bases are also being put to use in the data extraction arena.The study byBayatmakou et al. (2022), for example, introduced new methods for interactive multi-document text summarization that allow users to specify summary compositions and interactively refine queries after reviewing complete sentences automatically extracted from documents.Probabilistic Models.Among probabilistic models, Conditional Random Field (CRF) applications were predominant in our dataset.CRF was often applied for sequence labeling tasks, such as named entity recognition (e.g.,Nayak et al., 2021), or for classification tasks (e.g.,Angrosh et al. 2014).Overall, included studies provided evidence that CRF can form a powerful architecture when combined with RNNs (e.g., bi-GRU-CRF, bi-LSTM-CRF; seeNowak & Kunstman,  2019;Shen et al., 2022).We found a single application of the Maximum Entropy Markov Model (MEMM), however, based on experimental results the authors ultimately selected CRF for identifying sentence context for extraction from research publications(Angrosh et al., 2014).
(Goldfarb-Tarrant et al., 2020;n et al., suggested thShen et al., 2022;ed approaches have experienced rapid growth since 2020.Bidirectional Encoder Representations from Transformers (BERT) and other BERT-based language models made up the majority of transformer-based approaches.SpecificallyBERT (Aumiller et al., 2020;Shen et al.,Figure 7. Model architectures and components.2022Classifiers.For classification approaches, we followedSchmidt et al. (2021)in reporting instances of Support Vector Machines (SVM) separately from other binary classifiers and likewise found a high prevalence of SVM usage, accounting for 50% of all binary classifiers identified(Goldfarb-Tarrant et al., 2020; Shahid & Afzal, 2018;Shen et al., 2022; Zielinski & Mutschke, 2017).Among classifiers that use a linear combination of inputs (Jurafsky & Martin, 2024), naïve Bayes was the most frequent(Neppalli et al., 2016; Shahid & Afzal, 2018; Torres et al., 2012; Zielinski &  Mutschke, 2017).One study used a Perceptron classifier; however, it was extended (i.e., OvR) to handle multiclass problems(Aumiller et al., 2020).Multi-class classifiers were less common with one instance each of k-Nearest Neighbors (aka KNN/kLog; Zielinski & Mutschke, 2017) and the J48 classifier (C4.5 Decision Trees;Piroi et al., 2015).Neural Networks.Overall, there were a variety of neural network applications across the included studies.Most used Long Short-term Memory (LSTM), more specifically, Bidirectional LSTM (BiLSTM).We also identified one application Bidirectional Gated Recurrent Unit(BiGRU;Shen et al., 2022).Convolutional Neural Network (CNN) architectures(Goldfarb-Tarrant et al., 2020; Nowak & Kunstman, 2019; Anisienia et al., 2021)were also present.Several studies evaluated state-of-the-art deep learning methods.For example, Shen et al. (2022) compared the performance of deep learning models (TextCNN and BERT) for sentence classification in social sciences abstracts.In another comparative study, Anisienia et al. (2021) compared methods for pretraining deep contextualized word representations for cuttingedge transfer learning techniques based on CNN and LSTM architectures in addition to classifier models (e.g., SVM).Versatile and widely applicable, they offer a robust framework for automating data extraction or for capturing relevant information from large volumes of text.See Figure8for rule-based approaches reported across included studies.Overall, 70% (n=16) of included studies utilized rule-or heuristic-based approaches to support a variety of tasks for data extraction.Of these, nearly half (n=7) reported using Regular Expressions (RegEx).For example, based on rules developed from manual inspection, RegEx was used byTorres et al. (2012)to construct patterns for identifying specific Five studies provided description of user feedback and other ratings.User feedback (among other metrics) was reported byLi et al. (2022)who conducted expert human comparative assessment to assess fluency, relevance, coherence, and overall quality of model citation span/sentence generation outputs.This category also included evaluation metrics not listed in the sources we adapted when developing our protocol (see O'Mara-Eaves et al., 2015, p. 3, Table 1; Schmidt et al., 2021, pp.8-9).For example, in assessing their system on values returned for queries of interest, Nayak et al. (2021) reported suitably, adaptability, relevance scores, and data-dependencies.As another example,Denzler et al. (2021, p. 5) evaluated their artifact based on design science aspects (i.e., validity, efficacy, and utility).Given the rapid growth of domain-specific ontologies and pre-trained language models, it is not surprising to find Kappa statistics reported for tasks such as evaluating agreement between human annotators when creating gold standard datasets for training and evaluation (Cohen's Kappa, see Pertsas & Constantopoulos, 2018; Mezzich's Kappa or Gwet's AC1, seeAnisienia et al., 2021).Semantic similarity scores, which can be used to compare model generated responses against (Piroi et al.;n of leave-one-out or LOOCV(Piroi et al.; 2015)and one application of document level CV used as a supplemental technique to k-fold(Neppalli et al., 2016).groundtruthresponses in query-based or question-answering based applications, were reported in two studies(Jaccard  Index, Bayatmakou et al., 2022; DKPro Similarity, Zielinski & Mutschke, 2017).
(Chen et al., 2021;Denzler et al., 2021;prepared" (available at httpGoldfarb-Tarrant et al., 2020;Iwatsuki et al., 2017;Reader(Chenetal., 2021), which was available to users through an Rshiny application.SysRev(Bozada  et al., 2021)was also the only tool cataloged in the SR Toolbox(Marshall et al., 2022).Six of the twenty-three studies (26%) made source code openly available(Chen et al., 2021;Denzler et al., 2021; Diaz-Elsayed & Zhang, 2020;Goldfarb-Tarrant et al., 2020;Iwatsuki et al., 2017; Li et al. 2022).Article references and corresponding repositories are detailed in Table3.GitHub stood out as the most popular repository for code and data sharing, and one study made source code available online through an open access publisher.TransferabilityIn the evolving landscape of systematic reviews and meta-analyses, the adaptability of tools and technologies to new research domains emerged as a critical factor for enhancing research efficiency and scope.The insights provided by many of the authors working towards automation of data extraction illuminate the transferability of various tools and technologies for research targeting the extraction of data elements beyond PICO.

Table 3 .
Code repositories.Reference extraction of data across a range of research domains, including education, management, and health informatics.Chen et al. (2020) highlighted the adaptability of OATS, showcasing its broader application potential to fields beyond the authors' COVID-19 specific demonstration.Finally, Goldfarb-Tarrant et al. (

Table 4
(Chen et al., 2021)3)22;Nayak et al., 2021)ted as outlined by JARS(Appelbaum et al., 2018, p. 6).Each tool was assessed for potential to extract specific data elements by manuscript section (i.e., methods and results reporting elements pertinent to meta-analytic research; seeLegate & Nimon, 2023b).Where the authors did not state a tool name, we used the description of the tool as presented in the paper (e.g.,Bayatmakou et al., 2022;Nayak et al., 2021).Unlike ongoing research that focuses on data extraction from clinical literature (e.g., PICO elements/RCTs; seeSchmidt et al., 2023), specific reporting guidelines were not a primary focus of the studies we identified.However, authors described target entities and/or research methods of interest with high levels of specificity.For instance, extracting descriptive statistics, sample size, and Likert scale points(Neppalli et al., 2016)and extracting research hypotheses from published literature in organizational sciences(Chen et al., 2021).Despite the lack of discourse surrounding specific reporting guidelines, many of the tools reviewed incorporated some form of user-prompted, annotation-or query-based approach to (semi)automated data extraction.Thus, the collective body of work lends optimism surrounding customizable state-of-the art methods that can support extraction for a wide range of disciplines, research designs, and entities or data elements of interest to social science researchers.
(Piroi et al., 2015)2022), andle approach is extractive question-answering based on pre-trained Transformer models.Extractive question-answering models are able to generate direct answers from knowledge base in response to natural language questions posed by users(Kwiatkowski et al. 2019).These tools typically offer enhanced flexibility through user-defined prompts and mechanisms for interactive query refinement.Example tools that incorporated question answering techniques included CIRRA(Piroi et al., 2015), the Interactive Text Summarization System for Scientific Documents(Bayatmakou et al., 2022), and OATS (Chen et al., 2021).Other types of flexible systems allow users to view excerpts related to specific keywords or queries, supporting expedited identification and labeling of target data elements.For example several tools supported user labeling of data, followed by predictive classification based on user annotations.Although these tools do not automatically extract data for users, they do augment human effort by (semi)automating time consuming tasks associated with data annotation and extraction.For instance, Sysrev(Bozada et al., 2021)supports researchers in labeling and extracting data by leveraging active learning models developed to replicate user decisions across various review tasks.Likewise, MetaSeer(Neppalli et al.,  2016)developed ML techniques to identify and extract numbers from documents, which were then presented to users for manual annotation.Unlike question-answering models, human-computer interactions in these examples are not based on natural language queries, however, human expertise can be used to 'train' ML models to predict future annotation decisions.Similarly, to overcome the time-constraints of open-ended annotation in fields that lack domain-specific dictionaries, DASyR(Piroi et al., 2015)utilized a combination of user annotations, classification models, and contextual information for populating ontologies.They reported substantial reduction in annotation time, stating that through the DASyR UI "five experts added approximately 30,000 annotations at a speed of 4s/annotation" (p.595).
(Chen et al., 2020;seline review, we did not capture techniques used for optimization, training, or fine-tuning on specific datasets or tasks.Several techniques surfaced while conducting this review, such as class modifiers (e.g.,OvR; Aumiller  et al., 2020), genetic algorithms (Bayatmakou et al., 2022; Torres et al. (2012), Adam optimizer (Nowak & Kunstman, 2019);Shen et al., 2022), cross entropy loss(Chen et al., 2020; Li et al., 2022), Universal Language Model Fine-tuning (e.g.,ULMFiT; Anisienia et al., 2021), and back-propagation optimizers(Chen et al., 2020; Anisienia et al., 2021).With increasing applications of pre-trained language models that can be fine-tuned for specific applications (Jurafsky & Martin, 2024), inclusion of training and optimization approaches would provide a more comprehensive framework for reporting findings on ML/NLP approaches to data extraction.We plan to supplement future iterations of this review by capturing various optimization and training methods.Adapted code files and results for automated search and screening for ACL, ArXIV, and DBLP full database dumps.Data are available under the terms of the Creative Commons Attribution 4.0 International Public License (CC-BY 4.0).
○• Target Data Elements.docxkeyelements of interest for targeted data elements • Comprehensive List of Eligible Data Elements.xlsxcomprehensivelist of elements with extraction potential per APA JARS • Search Strategy.docxsearchsyntax for preliminary search in Web of Science • APA & Cochrane Data Elements.xlsxtableddata elements for Cochrane reviews, APA Module C (clinical trials), and APA (all study designs) Data are available under the terms of the Creative Commons Attribution 4.0 International Public License (CC-BY 4.0).
MA, Cranefield S, Stanger N: Contextual information retrieval in research articles: Semantic publishing tools for the research community.Semantic Web.2014; 5(4): 261-293.Publisher Full Text Anisienia A, Mueller RM, Kupfer A, et al.: Research method classification with deep transfer learning for semi-automatic meta-analysis of information systems papers.Proceedings of the 54th Hawaii International Conference on System Sciences.2021;pp.6099-6108.analyses in social sciences: A living review protocol.2023a, January 12.
Table 2 caught my eye.For example Iwatsuki et al. (2017) about detecting in-line mathematical expressions or Torres et al. (2012) about software engineering or later Nayak et al. ( an expert in social science research, but a few included references in Table 2 caught my eye.For example Iwatsuki et al. (2017) about detecting in-line mathematical expressions or Torres et al. (2012) about software engineering or later Nayak et al. (2021) about cotton industry?

the rationale for, and objectives of, the Systematic Review clearly stated? Partly Are sufficient details of the methods and analysis provided to allow replication by others? Partly Is the statistical analysis and its interpretation appropriate? Not applicable Are the conclusions drawn adequately supported by the results presented in the review? Yes If this is a Living Systematic Review, is the 'living' method appropriate and is the search schedule clearly defined and justified? ('Living Systematic Review' or a variation of this term should be included in the title.) Partly Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Thank you for your thoughtful and detailed feedback on our manuscript.We appreciate the time and effort you have invested in providing suggestions to enhance our work.We also value rigorous research methods and reporting transparency and would like to clarify several points regarding the reporting guidelines we adhered to and the journal's policies and requirements.